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AbstracUThe growing volume of data usually 
creates an interesting challenge for the need of data 
analysis tools that discover regularities in these 
data. Data mining has emerged as disciplines that 
contribute tools for data analysis, discovery of 
hidden knowledge, and autonomous decision 
making in many application domains. The purpose 
of this study is to compare the performance of two 
data mining techniques viz., factor analysis and 
multiple linear regression for different sample sizes 
on three unique sets of data. The performance of the 
two data mining techniques is compared on 
following parameters like mean square error (MSE), 
R-square, R-Square adjusted, condition number, 
root mean square error(RMSE), number of variables 
included in the prediction model, modified 
coefficient of efficiency, F-value, and test of 
normality. These parameters have been computed 
using various data mining tools like SPSS, XLstat, 
Stata, and MS-Excel. It is seen that for all the given 
dataset, factor analysis outperform multiple linear 
regression. But the absolute value of prediction 
accuracy varied between the three datasets 
indicating that the data distribution and data 
characteristics play a major role in choosing the 
correct prediction technique. 
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1. Data Introduction 

A basic assumption concerned with general linear 
regression model is that there is no correlation (or 
no multi-collinearity) between the explanatory 
variables. When this assumption is not satisfied, 
the least squares estimators have large variances 
and become unstable and may have a wrong sign. 
Therefore, we resort to biased regression 
methods, which stabilize the parameter estimates 
[17]. The data sets we have chosen for this study 
have a combination of the following 
characteristics: few predictor variables, many 
predictor variables, highly collinear variables, very 
redundant variables and presence of outliers. 

The three data sets used in this paper viz., 
marketing, bank and parkinsons telemonitoring 
data set are taken from [8], [9], and [10] 
respectively. 



From the foregoing, it can be observed that each of these three sets has unique properties. The marketing 
dataset consists of 14 demographic attributes. The dataset is a good mixture of categorical and continuous 
variables with a lot of missing data. This is characteristic for data mining applications. 
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Fig 1 Box Plot of Marketing Dataset 



Fig 2: Box Plot of Parkinson Dataset 



The bank dataset is synthetically generated from a simulation of how bank-customers choose their banks. 
Tasks are based on predicting the fraction of bank customers who leave the bank because of full queues. 
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Each bank has several queues, that open and close according to demand. The tellers have various 
affectivities, and customers may change queue, if their patience expires. 
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Fig 3: Box Plot of Bank Dataset 

In the rej prototasks, the object is to predict the rate 
of rejections, i.e., the fraction of customers that are 
turned away from the bank because all the open 
tellers have full queues. This dataset consists of 32 
continuous attributes and having 4500 records. 
The parkinsons telemonitoring data set is composed 
of a range of biomedical voice measurements from 
42 people with early-stage Parkinson's disease 
recruited to a six-month trial of a telemonitoring 
device for remote symptom progression monitoring. 
The recordings were automatically captured in the 
patient's homes. Columns in the table contain 
subject number, subject age, subject gender, time 
interval from baseline recruitment date, motor 
UPDRS, total UPDRS, and 16 biomedical voice 
measures. Each row corresponds to one of 5,875 
voice recording from these individuals. The main aim 
of the data is to predict the total UPDRS scores 
('total_UPDRS') from the 16 voice measures. This is 
a multivariate dataset with 26 attributes and 5875 
instances. All the attributes are either integer or real 
with lots of missing and outlier values. 
The box plot of the three datasets (fig 1 to fig. 3) 
shown above display measure of dispersion between 
these variables, compares the mean of different 
variables, and also shows the outliers in three 
datasets. In this regard, it becomes necessary to 
scale these three datasets to reduce the measure of 
dispersion and bring all the variables of all datasets 
to the same unit of measure. 
2. Prediction Techniques 

There are many prediction techniques (association 
rule analysis, neural networks, regression analysis, 
decision tree, etc.) but in this study only two linear 
regression techniques have been compared. 



2.1 Multiple Linear Regression 

Multiple linear regression model maps a group of 
predictors x to a response variable y [4]. The multiple 
linear regression is defined by the following 
relationship, for /'= 1, 2, n: 

Yi=a + b^Xn + b 2 x & + ■ ■ ■ +b k x ik + e, 



or, equivalently, in more compact matrix terms: 
Y=Xb+ E 

where, for all the n considered observations, V is a 
column vector with n rows containing the values of 
the response variable; X is a matrix with n rows and k 
+ 1 columns containing for each column the values of 
the explanatory variables for the n observations, plus 
a column (to refer to the intercept) containing n 
values equal to 1; b is a vector with k + 1 rows 
containing all the model parameters to be estimated 
on the basis of the data: the intercept and the k slope 
coefficients relative to each explanatory variable. 
Finally £is a column vector of length n containing the 
error terms. In the bivariate case the regression 
model was represented by a line, now it corresponds 
to a (k + 1)-dimensional plane, called the regression 
plane. This plane is defined by the equation 

y i= a + bix n + b 2 x& + • • • +b /t x*+u i 

Where ft is dependent variable. X, s are independent 
variables, and u, is stochastic error term. We have 
compared three basic methods under this multiple 
linear regression technique. They are full method 
(which uses the least square approach), forward 
method, and stepwise approach (which used 
discriminant approach or all possible subsets) [5]. 

2.2 Factor Analysis 

Factor analysis attempts to represent a set of 
observed variables X h X 2 .... X n in terms of a number 
of 'common' factors plus a factor which is unique to 
each variable. The common factors (sometimes 
called latent variables) are hypothetical variables 
which explain why a number of variables are 
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correlated with each other- it is because they have 
one or more factors in common [7]. 

Factor analysis is basically a one-sample procedure 
[6]. We assume a random sample yi, y 2 , y„ from a 
homogeneous population with mean vector ju and 

covariance matrix X ■ The factor analysis model 
expresses each variable as a linear combination of 
underlying common factors f u f 2 , . . . , f m , with an 
accompanying error term to account for that part of 
the variable that is unique (not in common with the 
variables). For y u y 2 , y p in any observation vector y, 
the model is as follows: 



y^-|J^= An f| + /i2 k +■ ■ -+Mm f m + £1 

y 2 - jU 2 = A 2 \ h + ^22 k +" " '+A 2 m fm + ^2 



Yp- fJp = A°1 ft + Ao2 k +" " m +A pm fm + £p- 



Ideally, m should be substantially smaller than p; 
otherwise we have not achieved a parsimonious 
description of the variables as functions of a few 
underlying factors. We might regard the f's in 
equations above as random variables that engender 
the ys. The coefficients Ay are called loadings and 
serve as weights, showing how each y, individually 
depends on the f's. With appropriate assumptions, A t j 
indicates the importance of the /th factor /j to the /th 
variable y, and can be used in interpretation of fj. We 
describe or interpret f 2 , for example, by examining its 
coefficients, /\ 12 , ^22, A p2 . The larger loadings relate k 
to the corresponding j/s. From these /s, we infer a 
meaning or description of f 2 . After estimating the Ay 
's, it is hoped they will partition the variables into 
groups corresponding to factors. There is superficial 
resemblance to the multiple linear regression, but 
there are fundamental differences. For example, 
firstly f's in above equations are unobserved, 
secondly equations above represents one 
observational vector, whereas multiple linear 
regression depicts all n observations. 

There are a number of different varieties of factor 
analysis: the comparison here is limited to principal 



component analysis, generalized least square and 
maximum likelihood estimation. 

3. Related Work 

There are many data mining techniques (decision 
tree, neural networks, regression, clustering etc.) but 
in this paper we have compared two linear 
techniques viz., multiple linear regression, and factor 
analysis. In this domain there have been many 
researchers and authors who compared various data 
mining techniques from varied aspects. 

In year 2004 Munoz et. al did a comparison of three 
data mining methods: linear statistical methods, 
neural network method, and non-linear multivariate 
methods [11]. In 2008, Saikat and Jun Yan 
compared PCA and PLS on simulated data [12]. 
Munoz et.al compared logistic regression, principal 
component regression, and classification and 
regression tree with multivariate adaptive regression 
spines [16]. In 1999, Manel et.al compared 
discriminate analysis, neural networks, and logistic 
regression for predicting species distribution [13]. In 
year 2005, Orsalya et. al compared ridge regression, 
pair wise correlation method, forward selection, best 
subset selection, on quantitative structure retention 
relationship study based on multiple linear regression 
on predicting the retention indices for aliphatic 
alcohols[14]. In year 2002 Huang et. al compared 
least square regression, ridge and partial least 
square in the context of the varying calibration data 
size using only squared prediction errors as the only 
model comparison criteria [15]. 

4. Preparation and Methodology 

Both the techniques under study are linear in nature 
and the choice of technique is vital for getting 
significant results. When a nonlinear data are fitted to 
a linear technique, the results obtained are biased 
and when linear data are fitted to a non-linear 
technique, the results have increased variance. As 
the techniques undertaken for this study are both 
linear, so to get significant results we need to apply 
the same on linear data sets. Both the techniques 
are linear regression techniques, we mean that they 
are linear in parameters [1] [2]; them's (that is, the 

parameters are raised to the first power only. It may 
or may not be linear in explanatory variables, the X's. 
To make our data sets linear it is preprocessed by 
taking natural log of all the instances of the data sets 
or normalized using z-score [3] normalization. After 
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scaling and standardizing the three datasets, it is 
found that skewness is reduced that is shown by 
histogram diagram of all three datasets. For proving 
linearity of these data sets box-plot, histogram and 
JB Test (Jarque Bera Test) with p-value (exact 
significance level or probability value of committing 
type-l error) have been used. 
After scaling and standardizing the data sets are 
divided into two parts, taking 70% observations as 
the "training set" and the remaining 30% 
observations as the "test validation set"[3]. For each 
data set training set is used to build the model and 
various methods of that technique are employed. For 
example in Multiple Linear Regression (MLR), three 
methods are associated in this study: the full model, 
forward model and stepwise model. The model is 
validated using test validation data set and the 
results are presented using ten goodness of fit 
criteria. Both the techniques are intra and inter 



compared for their performance on the underlying 

three unique datasets. 

5. Interpretation and Findings 

Refer to table land table 2 given below. 

5.1 Interpreting Marketing Dataset 

In marketing dataset, the value of R 2 and Adj.R 2 , of 
full model was found with good explanatory power 
i.e., 0.47, which is higher than both stepwise and 
forward model. 

On the behalf of this explanatory power value we can 
say that among all methods of multiple linear 
regression, full model was found best method for 
data mining purpose, since 47% change in variation 
in dependent variable was explained by independent 
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Table 1 



variables. But 0.47 value of explanatory power is 
not significant up-to the mark which requires 
another regression model than multiple regression 
model for reporting data set, since 0.53 means 
53% of the total variation was found unexplained. 
So, within multiple regression techniques full 
model was found best but not up-to the mark. 
Value of R 2 suggest for using another regression 
model. 



The inclusion of some other independent variables 
(either relevant or irrelevant) in multiple regression 
model mostly generate non-decreasing 
explanatory value or R 2 value. In this case we can 
use anther good measure of R 2 i.e., Adj. R 2 , which 
accounts for the effect of new explanatory 
variables in the model, since it incorporate degree 
of freedom of the model, or denominator of the 
explained and unexplained variation[18]. The 
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expression for the adjusted multiple determination 
is: 



Adj. R 2 = 1-(1-r 2 



n-1 



Adj. R = 1- 



n - k 
2> 2 Kn-l) 



is large in relation to the sample observations Adj. 
R 2 will be much smaller than R 2 and can even 
assume negative values in which case Adj. R 2 
should be interpreted as being equal to zero. 

For marketing data set, all methods of multiple 
linear regression Adj. R 2 was found similar to R 2 
value which means sample size is sufficiently large 
as required for data mining purpose [19]. 



If n is large Adj. R and R will not differ much. But 
with small samples, if the number of regressors X's 
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Table 2 



The R 2 in case of marketing dataset for factor 
analysis was found around 0.58. So, all methods 
have equal explanatory power under factor 
analysis. More over, under all methods viz., PCR, 
Maximum Likelihood, and GLS, explained variation 
is 58% out of total variation in the dependent 
variable which signifies that factor analysis 
extraction is better than multiple linear regression. 
R 2 can also be estimated through the following 

2 ESS 
notations: R = 

TSS 



TSS = Explained Sum Square(ESS)+ 
Residual Sum Square(RSS) 

The Adj. R 2 i.e., adjusted for inclusion of new 
explanatory variable was also found 0.56 less than 
R 2 . The 58% variation was captured due to 
regression, it explains the overall goodness of fit of 
the regression line to marketing dataset due to use 
of factor analysis. 

So, on the behalf of first order statistical test (R 2 ), 
we can conclude that factor analysis technique is 
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better than multiple regression technique due to 
explanatory power. 

Mean Square Error (MSE) criteria is a combination 
of unbiased-ness and the minimum variance 
property. An estimator is a minimum MSE 
estimator if it has smallest MSE, defined as the 
expected value of the squared differences of the 
estimator around the true population parameter b. 

MSE( b) =E(£-b) 2 . It can be proved that it is 
equal to 

MSE(fc)'s 

=Var(£ )'s+bias 2 (£ ) 

The MSE criteria for unbiased-ness and minimum 
variance were found increasing under multiple 
linear regression models. It signifies that full 
method MSE is less than all model's MSE, which 
further means that under full model of multiple 
linear regression of marketing dataset there is less 
unbiased-ness and less variance. 

The minimum variance also increases the 
probability of unbiased-ness and gives better 
explanatory power like R 2 in marketing dataset. 

The inter comparison of two techniques multiple 
linear regression and factor analysis generated 
that in factor analysis models MSE is significantly 
different which signifies that under factor analysis 



all b's are unbiased but with large variance. Due to 
large variance in factor analysis techniques the 
probability value of unbiased-ness increases that 
generates a contradictory result about the 
explanatory power of the factor analysis methods. 
But factor analysis methods may have 
questionable values of MSE, due to this reason 
new measure of MSE that is RMSE (root mean 
square error) was used in the study. 

RMSE was found considerably similar in methods 
of both the techniques. Due to less variation in 
RMSE of both MLR and factor analysis of 
marketing dataset it can be stated that both 
techniques have equal weights for consideration. 

A common measure used to compare the 
prediction performance of different models is Mean 
Absolute Error (MAE). 

If Y p be the predicted dependent variable and Y be 
the actual dependent variable then the MAE can 
be computed by 

i Y\y-y"\ 

MAE=- — [ 

n Y 

In marketing dataset MAE was found less under 
full model, which is less than stepwise and forward 
model. MAE signifies that full model under MLR 
techniques give better prediction than other mode 
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Fig 4: MLR-Full Model (Marketing) 
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Fig 5: MLR-Stepwise Model (Marketing) 
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Forward Predicted V. Actual 
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Fig 6: MLR-Forward Model (Marketing) 



Fig 7: MLR-Full Model (Bank Dataset) 
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Fig 8: MLR-Forward Model (Bank Dataset) 



Fig 9: MLR-Stepwise Model (Bank Dataset) 
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Fig 10: MLR-Full Model (Parkinson Dataset) 



Fig 1 1 : MLR-Forward Model (Parkinson Dataset) 
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Fig 12: MLR-Stepwise Model (Parkinson Dataset) 



Fig 13: Factor Analysis-GLS Model (Marketing Dataset) 
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Fig 14: Factor Analysis-PCR Model (Marketing Dataset) Fig 15: Factor Analysis-Maximum 

Likelihood Model (Marketing Dataset) 
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Fig 16: Factor Analysis-GLS Model (Bank Dataset) 
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Fig 17: Factor Analysis-PCR Model (Bank Dataset) 




Maximum F 
Maximum Actual 



Fig 18: Factor Analysis-Maximum Likelihood 
Model (Bank Dataset) 
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Fig 19: Factor Analysis-PCR Model (Parkinson 
Dataset) 




Fig 20: Factor Analysis-GLS Model (Parkinson Dataset) Fig 21 : Factor Analysis-Maximum 

Likelihood Model (Parkinson Dataset) 



Under factor analysis marketing dataset MAE in all 
models was found considerably similar but higher 
than multiple regression techniques, therefore we 
can say factor analysis models for such kind of 
datasets generate poor prediction performance. 



The diagnosis index of multi collinearity was found 
significantly below 100 under MLR methods in 
marketing dataset, which means there is no scope 
for high and severe multi collinearity. In case of 
same dataset condition number was found lower 
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than factor analysis technique. This means factor 
analysis is better technique to diagnosis the effect 
of multi collinearity. But in marketing dataset both 
factor analysis and MLR techniques were found 
with less multi collinearity in regressors than 
severe level of multi collinearity. 

The F value in case of marketing dataset was 
found more than critical value with respect to 
dF(degree of freedom), in both techniques, which 
signifies that overall regression model is 
significantly estimated but stepwise model of MLR 
technique was found high F corresponding to its 
dF which means overall significance of the 
regression model was up-to the mark in case of 
stepwise method. The prediction plots of two 
techniques on marketing dataset better represent 
above discussion visually (see fig. 4-fig. 6 and 
fig. 13- fig. 15) 

5.2 Interpreting Bank Dataset 

In case full model of bank dataset explanatory 
power (R 2 ) was found considerably low due to 
residual, whereas in stepwise and forward model 
MLR generated satisfactory explanatory power. 
Due to stepwise and forward model 56% variation 
in dependent variable was explained with respect 
to independent variables. Another measure of 
explanatory power was also found satisfactory in 
case of stepwise and forward model but not in full 
model. 

On the other hand factor analysis models on bank 
dataset generated higher value of both R 2 and 
adjusted R 2 , which signifies that the explanatory 
power of factor analysis in case of bank dataset is 
more than MLR technique. Overall one drastic 
point was found that in all models of factor analysis 
and MLR, full model of MLR generated very poor 
R 2 value, which means this dataset is not having 
proper specification according to magnitude 
change. 

The MSE criteria for unbiasness and minimum 
variance for all parameters is found increasing 
under both factor analysis and MLR techniques, 
but all models of factor analysis are found with low 
unbiasness and variance than all models of MLR. 
It means both the technique parameters are 



significant, but MLR techniques parameters are 
significant with high variance. 

The RMSE is also satisfactory and upto the mark 
in case of factor analysis. Therefore, we can say 
that factor analysis parameters have low variance 
and unbiasness. 

The prediction power of the regression model is 
also found good fit in all factor analysis models. In 
case of bank dataset MLR is having more MAE 
due to test dataset skewness. 

Modified coefficient of efficiency was found low in 
case of factor analysis model in case of bank 
dataset, since this dataset does not satisfy the 
center limit theorem due to constant number of 
variables; but in MLR model modifies coefficient of 
efficiency was found considerably significant for all 
models. This may be due to the successful 
implementation of center limit theorem. 

In case bank dataset the diagnosis index of multi- 
collinearity was found higher in factor analysis than 
MLR, which signifies that factor analysis is better 
technique to identify multi-colinearity problem. 

The F value in case of bank dataset was found 
significant under MLR model but F value was 
found very low rather in case of factor analysis 
was found 200 times more than the critical value, 
which means overall significance of all factor 
analysis model is higher than MLR model. The 
prediction plots of the two techniques (see fig. 7- 
fig. 9 and fig. 16- fig. 18) corroborate our 
discussion. 

5.3 Interpreting of Parkinson Dataset 

In case of Parkinson dataset forward model of 
MLR was found very low explanatory power, it is 
due to hetroscedasticity in stochastic error term of 
the model, but the full and the stepwise model was 
found to have 90% explanatory power of the 
model. In all models of factor analysis R 2 was 
found to have 60%, which is considerably sufficient 
for satisfactory explanatory power of the model. 
Moreover adjusted R 2 was found similar in both 
techniques i.e., MLR and factor analysis, due to no 
intrapolation. 
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In case of MLR models on Parkinson dataset MSE 
was found low and up-to the mark, which signifies 
that MLR technique is better technique for the 
extraction of structural parameters with unbiasness 
and low variance. On the other hand factor 
analysis was found having high biasness and high 
variance for extracting structural parameters of the 
model. 

RMSE was found similar in all models of MLR and 
factor analysis which signifies the same 
consideration for unbiasness and variance. 

The prediction power (MAE) of two models of 
factor analyis viz. PCR and maximum likelihood 
was found significant but GLS model prediction 
power was found considerably higher than PCR 
and maximum likelihood methods. On the other 
hand MLR prediction power was found significantly 
different in all three models. In case of stepwise 
and forward models prediction power increased 
more than full model. 

The center limit theorem for getting efficiency of 
the model was found incompatible, but in case of 
factor analysis it was found satisfactory to the 
center limit theorem. Overall inn case of factor 
analysis modified coefficient of efficiency was 
found increasing. 

In Parkinson dataset multi-colinearity extraction 
index was found higher under all models of MLR 
techniques except forward model. In factor 
analysis on the same dataset, this index was found 
lower than MLR model. This means MLR is better 
technique for diagnosing multi-colinearity 
particularly with full and stepwise methods. 

The significance of overall model was found higher 
in two models of MLR viz. full and stepwise 
methods but in case of factor analysis, overall 
significance of regression model was found similar 
in all methods. The forward method of MLR 
generated considerably low F value, which means 
overall significance is poor than another models of 
both technique. The prediction plots of two 
techniques on Parkinson dataset is given in figure 
1 0 to figure 1 2 and figure 1 9 to figure 21 . 

6. Conclusion and Future Work 



The analysis of linear techniques (MLR and Factor 
Analysis) suggests that factor analysis is 
considerably better technique than MLR. The 
principal component model extracted good 
performance on all datasets of the study. The good 
performance is said on the basis of higher 
explanatory power, higher goodness of fit, and 
higher prediction power. 

In diagnosis of multi-colinearity PCR model of 
factor analysis was found better model. However, 
full model of MLR also extracted satisfactory 
result. All other models of both the techniques 
were found with high explanatory power but with 
moderate prediction power. 

All models are best fit from the point of view of 
linearity and unbiased ness due to moderate 
variance and heteroscedasticity, distribution of 
residual term. Their prediction power was found 
considerably moderate fit. 

From the point of view of structural parameters 
and overall significance of regression model again 
factor analysis was found significantly up-to the 
mark. 

From overall analysis of regression technique we 
can say that data with high skew ness and large 
structural observations should be 
estimated/treated with principal component model 
of factor analysis. The dataset with high multi- 
colinearity should also be treated through 
factors/components according to relevancy. The 
small dataset on the other hand should be 
extracted through full model of multiple regression. 

The compatibility of a technique on particular 
dataset also depends on particular dataset's 
distribution of residual term of the model. In our 
study marketing or Parkinson dataset are having 
normal distribution of the residual term, on the 
other hand bank dataset residual term was found 
non normally distributed considerably. The 
violation of this residual assumption is affecting the 
prediction power for removing heteroscedastic 
variance of residual term. The method GLS should 
be adopted to estimate the structural parameters 
with suitable suggested forms of the regression 
model. 
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The techniques in which estimators satisfy BLUE 
(best, linear, unbiased, and efficient) properties of 
structural parameters estimates and stochastic 
random error term are considered better than 
others. 

The skewness of predictors and random term in 
the linear regression model is creating obstacles to 
satisfy BLUE properties. Reducing skewness with 
some advance data mining tool and then 
comparing performance of said techniques can 
further enlighten us, which is an area that can be 
further explored. 
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